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ABSTRACT 

This paper presents a critique of tlie Northwest 
Regional Educational Laboratory's (N,il,R«E,L, ) review of the 
Mat-Sea-Cal Oral Proficiency Tests in their publication. Oral 
Language Tests for Bilingual Students, That publication was released 
in July, 1976 as a guide to administrators 'and program coordinators 
in the selection of instruments for assessing :ptudents' language 
dominance and oral proficiency (-ies). In rating each instrument, 
four criteria were explbred: measurement validity, examinee 
appropriateness, technical excellence, and administrative usability. 
Several questions within each criteria were examined in determining 
the overall criteria rating. A de^^criptive review of the Mat-Sea-Cal 
is presented and the reviewer's rating is summarized in a chart. This 
critique scrutinizes the evaluations rendered to the Mat-Sea-Cal by 
the reviewers in.^ eacB of the four criteria. Discussion is offered on 
several points. Differences in perception between the author and the 
N*W.R*E,L. reviewers on the evaluation of the Mat-Sea-Cal are 
enumerated* (Author/CFM) 
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INTRODUCTIOII • 

This paper presents a critique of the Northwest Regional " * 

Educational Laboratory's review of the Mat-Sea-Cal Oral Proficiency 
Tests in their publication, Oral Language Tests ^f or Bilingual Studewts . 
That publication was released in July, 1976 as a guide to administra- 
tors and program coordinators in the selection of instruments for asses- 
ing students' language dominance and oral proficiency(-ies) . 

The N,W.R.E.L. reviewers, in rating each instrument, explored four 
criteria: Measurement Validity, Examinee Appropriateness, Technical 
Excellence, and Administrative Usability. Several questions^ wi thi n each 
criteria were. exami ned in determining the overall criteria rating. 

A descriptive review of the Mat-Sea-Cal- is presented on pages 101-06 
of the publication. The reviewers' rating is summarized in a chart on 
' Paiges 126 - 27 the booklet. — 

This critique ^scrutinizes the evaluations rendered to the Mat-Sea-Cal 
by the reviewers in each of the four criteria. Discussion is offered on 
several points. Differences in perception between this author and the 
N.W.R.E.L. reviewers on the evaluation of the Mat-Sea-Cal are enumerated. 

In citing references, where only page number? are given, the state- 
ment is attributed to the N.W.R.E.L. publication. Where outside sources 
are quoted, the standard author-title-page number format is followed. 



THE CRITICfUU- ' \ • ' 

^^^^ ^ 

* Thjs '^critique foVLoWi& the outline [<fcssented in the N.W.R.E.L.^ 
bpok+et in chapter three: .Evaluative Criteria^ This discussion, 
therefore, begins with Measurement Valiclity, followed by Examinee. 
Appropniatenesss Technical Excellence, «nd Administrative Usability — 
in that order; . , ' ^' 

CRITERION: MEASUREMENT VALIDITY 

Seven questions (pp. 30-?) were, considered in determinirrg an 

'instrument ' s' measurement val idity. However, ratings were given for 

*^ '■' " 

only two categories (pp? 126' - 27). Judging from the evaluation chart 

point scale (p. questionsf^^#6 through #e^(pp. 30-2) appear to 

have been combined .into the category "content and construct" (p. 126). 

The. second category within this criterion, "concurrent and predictive", 

apparently is composed of .questions #f and #g (p. 32). 

Of note, the authors of the N.W.R.E.L. publ ication have presented 
no rationale y^ithin the text for condensing seven discrete questions^ 
into two eval-iiation categories. 

Contend and. Coy ty:ucl . ' In this category ^the Ma'l-Sea-Cal received 
five of a possible se^yen points. This was the highest rating achieved 
by any of the elevep instruments reviewed. In fact, , the Mat-Sea-Cal was 
the only instrument to be awarded over hajf of the maximum possible . 
points alloted fo^r content an^l construct validity (p. 126). 

Thts rat.iiiig is significant in psychometric terms. ^ Content Sfnd con- 
struct validity are deeme^d as the initial , 'critical stages of inslruntjpnt 



development. - Content validity addresses the issue of whether an in- 
Strument samples the universe Tt purports to measure. Construct validity 
explores .the question of whether performance on the sampled items reflects 
an accurate measure of the respondee's knowledge of, or competence in, 
the theoretical constructs being tested" (Cronbach, in E. L. ThoVndike, 
Educational Measurement , p. 446). * ' 

As the initial phase of instrument development, demonstration of 
content and construct validity is , .therefdre, a pre-requisi te^o conducting 
further analyses. That the Mat-Sea-Cal was reviewed favorably by an 
independent agency, N.W.R.E.L., is a manifestation of the instrument's quality 

Concurrent and Predictiye. The Mat-Sea-Cal was awarded no points for 



concurrent and .predictive validity in the N.W.R.E.L. publication (p. 126). 
Concurrent and predictive validity are both .correlation\analyses between 
tije instrument under development and a measure of criteriah proficiency. 

Specifically, .concurrent validity entails the cor^elafion of .Ime test 
and the outcome data at very nearly the .same time. The Correlation between 



the two is the measure of concurrent validity. Emphasis is placed on 
^^'Selecting an appropriate outcome measure, as the new instrument *wi 1 1 be 
correlated with "whatever: the outcome measure tests." (Cronbacr*!, 
Essentia^ Qf Psychological Testing , pp. 104, 108, "^nd 117; see also 
Thornctike, p. 484). . - . ' • . 

TbG' primary interest ''of predictive validity is finding an a^urate 
measure of a future outcome. A score on the instrument under development 
is checked against a criterion measure. TlTe*~aim of testiog is to predict 
,this criterion, and the merit of ihe i nstrumentr is judged by the accuracy 

■'■ ■ ' • . ■„■ 



of its pr^ictiop. ' N(Jrmal:Ty, several months time elaose between 
testing and data gathering on tKe criterion measure. Success in - 
predicting the criterion ususally res.ults in a statistical decision 
theory. model. That is, a formula is used in assigning students to 
different treatinents in the future, based on the predictive validity 
findings. Here too, the .practical "worth" of the decision formulas 
.rests on the selection of a valid criterion measure of the competence 
or performance on which prediction is desired (Cronbach, op. 108 and*/ 
117; see also Thorndike,' op. 303, 443 + 44, and 484)'. 

Related to concurrent validity the N.W.R.E:L. review noted that 
significaht correlations (a( the .01 level) were established between 
the Mat-Sea-Cal and the S.R.A. Achievement Survey (pp: 104 - 05). 
These findings were originally published in Dr. Matluck'-s A.E.R.A. 
paoer in April, -1 976 (Matluck and Matluck, The Mat-Sea-Qal Instruments 
for Assessing La ngua ge Proficiency , p . 1 0 ) . < Ins P^I^'^P'T^ .Dr. Ma 1 1 u c k^ s 



sources" reveal s that these correlation^ were 1 arggEkibe.twen .30 and 





v70. Based on the N.W.R.F.L. ratinjg scale (p. 32) tne^at-Sea-Cal 

V * 

would, therefore, deserve one point for concurrent validity. 

Since no points werp awarded (p. 126),^orie must conclude thafthe 
reviewers did r^ot pursue Clr. Matluck's original sources. As his 
A.E.R.A. paper was delivered in early April and the N.W.R.E.L. publica- 
tion not released until July, ample time for checking'sources existed. 
In the light of such shortcomings, doubt must be raised with the reviewers 
stated intent that, "an Effort was made to obtain as much descriptive 
material as possible on eaq^ instrument" (p. 44)'. ' / 
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Regarding predictive validity, the Mat-Sea-Cal was agaifi > 
awarded no points 'by the reviewers. To date, no a-priori' predictive 
investigati"Dns with the instrument have been undertaken^ Thus, this 
rating appears justified. * • . 

f ' However, discussion may precede from whether the demonstration 
of predictive validity is within the scope of field test instruments. 
Predictive validity is usually the final hurdle of ""instrument develop- 

••ment.' It is Undertaken after other investigations (e^g.* reliability, 

* ' ■ ^ 'J 
item analysis, concurrent validity, test Revision , etc,)^have proven 

» ' - ' * * ' 

successful. In fact, demonstration of predictive validity indicates 

that an instrument is ready for commercial distribution. 

Summary (Measurement Validity ), the Mat-Sea-Cal Tests were awarded 

fiVe of eleven points f^ measurement validity. The reviewers thereby 

i 

classified thetTests as "poor" on this criterion. 

Two issues belie this rating. First, in the concurrent validity 
section the reviewefs failed to pursue source documents. As a result, 
"they overlooked significant correlations^which would have entitled the 
Mat-Sea-Cal to an additional rating point. 

The second question pertains to whether field-test instruments should 
be rsted on predictive validity in a manner similar to copierciaT tests. 
Demonstrating predictive validity would appear to be more domain of 
cpmmercial ly marketed measures. • ' 

In sum, the Mat-Sea-Cal should be credited with six rating points 
for measurement validity. Then, employing the nine or eleven^ point 
scale (excluding or including the predictive validity rating), the 

.10 



instrument shoul'cj be reclassified as "fair" on this criterion, 

or "good" en ^a Vine-point sc^le.. * 

CRITERION: EXAMINEE APPROPRIATENESS , ^ . 

Thirteen questipnrs (pp. 33-6) were ^ronsidereci^in determining the 
examinee appropriateness of an instrument. Points were awarded for 
twelve of the considerations. No rating (only a description) was given 
for "mode of examinee response," and no reasons were tendered for this 
exception. 

Like the measurement validity section, some questions were combined 
into single rating categories (of which there are nine). Questions #b * 
and #c (pp. 33'- 4) fOfm the category "item relevance" (p. 126). Questions 
#d, #e, #f, and #g make up the category "instructions." The remaining 
questions are retainetl as individual evaluation entities. No discussion is 
provided as to how and why some queries were aggregated, while others 
were retained as individual items. ^ 

As with the previous criterion (measurement validity), the maximum 
points awarded per Gategory in ^aminee appropriatepess vary. The point 
scale ranges from zerd^to^p^ur in some instances, while a zero-one. 



alternative os the chof^ yt^.at^rs. Are the concerns rated in one category^ 
four times as imp(5E|gtdnt ^"^s wos£j|.rated in another? The reviewers provide 
no enlightenment. ^ f" ' • • . 

Sum mary (Examlrm e A'P^rq^riateness)^ JLhg Mat-Sea-Cal Te^ts received 
fourteen of fiftee^j^fe^ibl]^ points on the' criterion of examinee appropriate- 
ness. ^ This rating earned the instrument a classification of "good"* for 
this cril^rion. , 
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Only in failing to require test" admini strators to inform examinees 
of the test's pyrposes (i .e. , "justification") did the Mat-Sea-Cal. 
not receive, the' maximum points in any category. This rating pre- 
ference assumes that primary-age children's perfomance is positively^ / 
affected by knowing why they are being tested. In actual setting^s 
'this information most likely motivates some youngsters, while creating 
anxiety in others. 

CRITERION: TECHNICAL EXCELLENCE ' * 

Four questions (pp. 37- 8) , were evaluated in determining an in- ' ' 

strument's technical excellence. Unlike each of the previous two 

<* • . ' 

criteria (va^lidity elnd appropriateness), each question was^ rated as, 
a separate entity. However, the maximum point value withvn-each category 
varied: from one to three poi nts . Again, these scale differentials 
remain unexplained. 

Several points, relate'd to the evaluation of technical excellence 
need to be made. F1rst»\the category of "replicability" appears, to be 

^n admlnlstt^atl ve matter Cthe next criterion) rather than a technical 

J 

concern. ^ 

. Of a more serlou^ nature is the reviewers' collective knowledge^ 
of the concept of reliability. Their\b1as favors Instruments capable 
of simultaneously exhibiting three typ^s of reliability: alternate 
form, -tesf-retesjt and. internal consistency. In doing so, the N.W.R.E.L. 
reviewers tailed to address whether eacR reliability type was congruent 
in nature to th:dt of the instruments befng 'evaluated. ^Also, under certain 
ci.rcumstantes , one type of reliability computation yields coefficients 

12 . , 



virtually .identical to those calculated by a second method (Guilford 
and \Fruchter, Fundamental Statistics in Edijication and Psychology , p. 410) 
As a^>other example, the creation of an alternative form for th^ purpose- 
of presenting a second i;el i abi lity 'coefficient is not a practice 
advocated by educational p^ycnometricians . (Stanley , in Thgrndike, 
pp. 4p4 + 5). Such state-of-the-art positions, however, have been 
ignored by th? reviewers in listing their reliability ratings for instru- 
ments. ^ ^ . ' . . 

Al ternat^l^ojrm. The^at-Sea-Cal was awarded ^o points for alternate 
/orm reliability. As only ond farm of the instrume)it '(per la/iguage) 
exists, thls^jS^ was (expected. j 

However, with power tests alternirb^ ,fx)rm and interna] consistency 
estimates of reliability "can be used almost interthangably" (Guilford ^ 

H 

b 

and Fruchter, p. 410). P'Owar tests .are those in which examinees have 
ample time to answer all questions. (Standard educational measurement 
texts 11st specific requirements - Thorndike, p. 192; see also Guilford 
and Fruchter, pp. 4(56 -07.) By instruction and .as demonstratfed in 
actual administration, the Mat-Sea-Cal permits sufficient tinie for all 

examinee responses. 'Thus, the rating of the Mat-S.ea-Cal for alternate 

r ' ' 

forms duplicates the evaluation for internal consistency (which is 
detailed later) . • 

Furthenmore, construction of an alternate form, solely to demonstrate 
a second relia'bllity coefficient would be an unnecessary depletion of 
test development resources. Creation of a second form place? additional 
requirements on test development. . \ 



First, the amount of variation in content and format betweert^ forms, 

must be skiljfuily balanced, if they are to be truely comparable. If 

the a-lternate forms differ too distinctly, ^the correlation between them 

will underestimate the desired reliability. .By contrast, if the two forms* 

overlap to an excess, the obtained correlat^^wi 1 1 overestimate the forms 

reliability (Stanley, in Thomd-ike, pp. 404 - 05). 

In addition, care 'must be taken to insure that items select;ed for 

the "second" form are representative of the respective un*>erse. Further, 

the manner in which items are chosen for inclusion in the "(Alternate" form^ 

must be equivalent, and not. reflect increased skill in item writing. 

Violation of either requiVement would have a del eterious- effect on the 

instrument's content arid construct validity (Ibid.), 

Finally, item statistics and correlations would need to be comparable, 

" Therefore, - ^ 

"if only a single form of a test is needed for 

the research or practical use %o which the test ; ' ■ 

is to be put, it seems unduly burdensome to 

prepare two sepa/ate tests in order to obtain 

an estimate of reliability" (Ibid., pp. 405 - 08.). 

Another type of alternate form reliability is the "instant readmin- 
. istration", or split-half technique' Here, the instrument is divided 
into halves (i^andomly, or in-pattern), administered once; but a second 
administration is considered to. have occurred "instantaneously." The two 
halves are separately scored, then correlated. However, as reliability 
is a function' of ^est length (to a point), the obtained coefficient is 
-spuriously low. It is, therefore, adjusted by the Spea.rman-Brown estim- 
ation formula. In addition to being an estimate of an underestimate, the 
spl1t-ha,lf coefficient is regarded as a one-form reliability correlation 
(Ibid., p. 369). 

14 ^ 
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^ In sum, an evaluation of the Mat-Sea-Cal for alternate form reliability ^ 
appears unnecessary. The instrument e)dii bits high Standards for reliability 
on^an internal cons^tenty measure (as will be dbcumented shortly). The 
instrument is a power test, there its internal consistency coefficients 
would be comparable to alternate form computations . Thus, though the 
instrument in its present form cannot shdw alterrtate form coefficients, 
th-is cannot be deemed detrimental to its overall psychometric quality. j 

Test-Retest . The Mat-Sea-Cal received no points for test-retest 
reliability from the reviewers. No test-retest; studied with the instru- 
mervt have been conducted, to date; thus, the rating is as expected. 

However, it should be noted that the test-retest technique is not 
readily applicable .tQ-_the.Mat-Sea-C6l . The test-retest technique indicates ' 
the stability of , examinee responses over time. High test-retest coefficients 

are associated wi th the rank ordering of examinee scores on the tested 

■ / 

constructs remaining fairly constant. 

Obtaining a retest coefficient with a one form instrument that 
measures oral prof iciency- would be difficult. If several months (or 
even weeks) elapsed between the two administrations, score differerjces 

if 

are likely to be confounded by the effects that schooling anjij individual 
maturational patterns 'have on children. In essence, the reliability 
coefficient would be affected to an unknown degree by factors beyond the 
tes^ting situation. (Ibid., p. 407; see also Guilford and Fruchter, pp. 
407 - 08). ^ . 

On the other hand, allowing only a short interval between administrations 
(e.g., a few days) introduces a memory effect. The examinees aVe^likely to 

15 . 
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recall specific questions and their responses to them. In such instances, 
it is recompiende'd that the test-retest procedure be avoided (Thorndike, 
pp. 407 +08). 

Th«i<^or technical reasons the Mat-Sea-Cal's reliability should not 
be computed by the test-retest meJJnod. The respondees' performance on 
the instrument wop Id not likely remain static over a long interval. By . 
contrast, a short test-retest cycle introduces a memory effect. ■ , 

Internal Consistency . The Mat-Sea-Cal received zero points f(W'«; ,; 
an internal consistency ratitig on the N.W.R.E.L. evaluation phart (p. 127). 
Internal consistency was to be demonstrated by either a split-half tech- 
nique or by a Kuder-J^ichardson formula coefficient (p. 37). 

Interestingly, the N.W.R.E.L. description bf the Mat-Sea-Cal (p. 1*05) 
lists Kuder-Richardson coefficients of .94 and .91 for the English and tKe 
Spanish tests, respectively. Thus, .the revieVjefs offer a new and intriguing 
evaluation system! They describe necessary criterTa (p. 37), report 
coefficients meeting the criteria (p. 105), yet refuse to award the rating 
points (p. 127)?: 

Such mrnor overt^i^ts reach an unpalatable level when N.W.R.E.L. 
solicits^U.S.O.E. endorsements, to the effect that, 

"Administrators, teachers, and other school 
personnel involved in planning bilingual/bicul tural 
programs. . . will find .thts document- an invaluable 
jtlcj. . '\ ih providing objective, comprehensive evalua- 
- *tion of these tests in ordet^ to facilitate the se- 
\ ' lection bf appropriate measurement instruments" (pp. 5-6). 

Practicing educators are rarely trained linguists or psycfjometricians . 

They are apt to rely on organizations such as N.W.R.E.L. and U.S.O.E. 

for<up-to-date, factual information on technical matters. Misinformation,* 

16 



such as the above, does little to enhancfe 'the quality of technical input ^ ^ 
on which educational decisions are often tiased. 

Correcting the oversight in the >eval|jation chart (p. 127) would 
credit the Mat-Sea-Cal with two' points #'^r internal consistency (the 
maximun\|al lowed within this category), f^or reasons unstated; the reviewers 
permit a maximum df three points for alternate form, or test-retest 
reliability, while two is the maximum fpr internal consistency. 

Replicability , The reviewers gayji the Mat-Sea-Cal no ptfints 1^ ' 
this, category. Replicability dealt with whether testing proci^dures . ; ■ 
outlined in the ^djhinistra tor's manual could be duplicated inS^ pthef situations 
Two items deserve mention in critiquing this category. ^ 

First, as defined above, replicability is an administrative consider- 
ation, not a statistical/technical matter. The major portion^of this, 
the technical excellence, cr^iterion dealt with measurement (specifically,' 
reliability). In fact, nine of the ter>, evaluation points ^n the 'Cri tenon- 
were reserved f^r|^1iabi lity. ^ Furtherinore., replicabil ity appears *ljiore 
congruent to the/^xt criterion, administrative usability. ' 

Second, items evaluated as replicability (pp. 37.-8)'^are rated 
throughout the administrative' usabi lity section. For example, "adminis- ' 
trative details" are evaluated in questions #a through #c (of administrative 
. usability^. "Scoring" is rat^d in i^tems #d through ^f. "Interpretabi 1- 
ity" is the focus of concerns #h, #i , and #n. "Standardization" is 

V" 

reviewed in items and #j. . 

* ■ . ''^ 

In short, replicability is both misplaced, and a duplication of the 

ratings in other sections. 
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« Summary (Techrvlcal Excellence) . On technical excellence the W,W.R.E.L, 

* • » • • 

reviewers rated .the Mat-Sea-Cal as. "poor" / This rating was a product 

of the reviewers marginal expertise of the concept- of reliability, and an 

onrission related to internal consistency. As^ a result, the rating as 

"poor" on technical excellence is unsubstantiated. 

The Mat-Sea-Cal Tests^ demonstrated high internal consistency coeffi-' 

cients. A discussion as to the comparability o^. these coefficient^ tO'<\ . 

alternate form coefficients was provided. Sii^ilarly, the inappropriate- 

• / • . ■ ■ ■ , 

ness of the test-re test method for an iri§trument sucH. as the Mat-Sea?-Cal 

'' ■ \ ' 

was presented. ; . 

Finally, doubt was cast as to whether replicability belonged with - 
technica*! excellence or ,the administrative usability criterion. Further, 
it was noted that considerations withip the replicability .category were 

rated elsewhere. - 

In conclusion, the Mat-Sea-Cal has demonstrated high internal 
consistency coefficients as measures of reliability,' The instrument has, 
thgs, met the major concern of the technicaf^xcellence criterion. The»^e- 
fore, a rating of "good", not "poor", .is justified. 

CRITERION: ADMINISTR^^TIVE USABILITY : ' ' . ' 

Fourteen- ques'tions were Qonsidered in determining an instrument's 
administrative usabi 1 i^^jr^. Each question was evaluated Independently, and 
points were awarded in fourteen separate categories. The pointr-^4'le . 
ranged from zero-,two with four of the considerations, zero-one for: the 
regaining- ten questions! Again, no. discussion was provided for. the 
difference in 3calin|| range. , ' 

• - . 18 . ' ■ •. 
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The Mat-Se^Cal was awarded the maximum point ,value.on four 
items: trainiiig of administrator, number of administrators, r*ange of the 
test, and diversity of skills measured. As the maximum poir>t value on 
these items was achieved, np additional discussion of \hem is provided 
here. Instead, comments a/e directed toward the remaining teh 'consider- 
ations. , . - 

r . . . f 

Clarity of Manual . /The Mat-Sea-Cal was awarded no points in this 

category. Aspects consijbered in evaluating test . manuals included: 

"discuss on of purpose, uses, and limitations 
of the test; clear administering and scoring 
. directions ; and description of test develop- 
validation" (p. 38). > 



. ment and 
The Mat-Sea-Cal Te 



t administrator's manual reads explicit in regards 



to purpose, uses, limititi^onjs and directions^atluck and Matluck, 



Mat-Sea-Cal Oral Profici 



endy Tests (Field Test Edition) , pp. 2-8) 



1 . . ' 

The manual does not prov|'de\a full description of the test's develop- \ ^ 
ment and valid&^ion. 

^ As a field-test instrumstpt , development and val idation of the Ma t- 
Sea-Cal ijs not complete. Therefore, the .question arises as to* whether the^ 
progress data should be neported in the manual; and if sro, how often the 
manual should be updated. The alternative view ho,lds that preliminary 
information may be misleading, and be pr'oven partial ly jjia^Curate when, 
the vaL(f^tion process is completed. This alternative view, would hold 
for tie completion of the development/validation proQj&W^s, when data , 
would be supplied in their entirety. 



19 



15 



The N'.W.R.E.L. evaluators pr^fef the former course, employing a 
zero-one scale. Thus, they were preventfed .from awarding points"tT> tf(e 
Mat-Sea-Cal tor the i nformatidn 'that is contained in its^manual. 

Had the reviewers seflected a. broader point sc^ale, a more ^'n/orma- 
jtive comparison of ir^strumemts woaldj/iave-'resul.ted . ^ Ten' of the eleven ^ ^ 
tests reviewed in the N.W.R.f.L. publication received no points in, this 
Category (p. 127) . ^ • , 

.aTsp, items covered Tn the evaluation, of test manuals are further 
scrutinized elsewhere in the review scheme. For example, test ^ievelop- 
ment includes item selection methods (questions #a and #b tn measure-: 
merit validity). Instrument validation encompasses concurrent /and pre^ 
dictive validity studies (quest-ions #f and #g in itieasurement validity). 
Scoring processes' are also reviewed -in poller" categories of administra- 
tive usability (questions #d and #h), 

Scoring . The Mat-Sea-Cal received one of two possible points for 
ease and objectivity in scoring.* The second and third sections of the 
test do require the test administrator to listen to, or observe, examinee 

responses. The N.W.R.E.L. reviewers felt that such tasks 'i nvol ve a degree 

t 

of difficulty which detracts from the scoring ease. The reviewers 
favored templates and stencils as methods .of scoring and conversion. 

However^ the topic may be broached as to whether quality in scoring 
/is necessarily a function of templates, and the like. Providing more 
detailed examples Of correct and incorrect responses would a^eviate 
doubts .relafted to the objectivity of the Mat-Sea-CaVs scoring process, 
thou^gh. ' ' , 
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In , addition, scoring is rated in questions #b (p^'. 38) and #h (p;,)*t)) 



of administra/tive usability, andis incorporated into question #d'(pp. 37 
-8') of technical ex,cellence. ' . 

Training ., The Mat-^ea-Cal failed to receive- points for the questipn 
af^fl^ Ttiay "interpret test sc8res. Consideration was given' as to the 
extent special training was^ required for acc^jrately interpreting test 
scores. SfcfTup was pUcecJ on reg^ilar teaching staff being abl^ to do th% 
Interpretation. ^ , ^ . 

•^Several questions may be posed as to the merit in this judgment. v 
For example, is the typical classroom teachor adequately prepared to 
determine language dominance or evaluate oral* proficiency? Are -such 
determination? always ^wi thi n the gr^sp of simple score -conversions? 
, Is a specially qualified test score interpreter intrinsically less , 
^desirable than a classroom^ teacher manipulating a formula or a template? 

r, 'how acturately can a typical classroom teacher manipulate a template 
or formula?) • ■ ' * 

The answers to such questions depend on the constructs being tested, 
the depth to which traitsare measured, and the implications of decision 
making based on 'data interpretation. Language dominance and oral proficiency 
(the Mat-Sea-Cql's domain) can become intricate issues that require 
sophisticated interpretation, peci^'ions based on such data interpretation 
Will affect the learning activities offered to individual students. 
Furthermore, misinterpretation of test data, in addition to being dele- 
terious .to the students, cQuld lead to very nasty legal complications. 
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By conjparispn, surveying the exti^t of hotne language usarge in a 

. - » • ■ * •' ' " 

Gomniuntty would be a simpler matter. Survpy data can be gathered and 



tabulated, ^nd some questions on community language usage ansv^ered. 
However, conducting surveys, al so *reqIH res con^i'derable exp^tise, . 

specifically to insure that datef are gathered in an accura'te and 

' -I 

an objective manner, , • > 

. •. " This does not suggest that one purpose is inherently, more i^iluable 

^^+ian another. It points out, though, that purposes\i 1.1 differ; arjd 
that as they differ, so will tife means of language assessment and data 
intecpretation . In short, J:he complexities of language assessment and 
interpretation are dictated -by the ^depth of information requ^i red *to -meet 
.stated purposes.' 

^ - ' ^ However^, evaluating who can«t-interpret scores was not viewed in tennis, 
of the complexities of the constructs measured. . Nor was interpre^tation - 
considered in respect to the implications that certari est-based decisions 
might have. ' ' , 

S^core Conversion . The Mat-Sea-Cal received one of two ^points ^or/^ . 
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clarity and simp|j|city in the conversion of raw scores to interpreted . 
scores. Refinement^ of the instrument* s score conversion techniqjujes 
would be a desired prcrduct of the validation process. 

r . _ . 

The concerns rated in this section were also examined in questions 
^b' (p. 38) and^d (p". 39). 

Interpretation . The Mat-Sea-Cal was recipient of no points for 
ease of interpretation of test scores. Va^ue was placed on scores that, 
yieljied binary judeiments, grade equivalents,, percentiles, and the like. 

' .22 
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Arguments, here, with the review would be similar to those made 
• in the section on training. Not all considerations in the linguistics 
fielii reduce to "binary, yes-no conclusions (proficiency and dominance, 
as exanroJes). Ease of interpretatipn needs review in terms of constructs 
IS measured. ' - ' - *. . 




^ 'rf&ta (grade equi valjents Jind percentiles) are outcomes of < 
a cofnpl^'ted v.al idlat.ion process. For the ^Mat-Sea-Cal , samp.le-speci.f ic 
norms were, referenced in Dr. Matluck's A.E.R.A. paper,- and its source 
documents. 

Overlap of evaluation 'topic ts, also evident between this' category 



and items 39) and #e (p. 40) 

Val idattrtg ;Group. ./ the. Mat-^Sea-Cal , like ever>^''test rated, failed 
receiv.e' l?(3lnt;$|;^ of the validation group (p. 127), 

'^'^(^ *' ' Five concerns (p. 41) were examined in determining representativeness. 



The N.W.R.E.L. reviewers claimed -that neither-the sample sizes nor 
their characteristics were reported in Mat-Sea-Cal studies to date. In 
fact, a thorough inspection of documents cited in Dr. Matluck's A.E.R.Ai. 
.paper w6ul^ refute 'the reviewer' s 'claim, fhose sou'rc6s enumerated 
upon sample sizes, geographical repre^eritation, and population char- 
acteristffcs of the examinees. Data analyse? employed by tho^ Studies 
also examined the sample groups for fche-effect^ of Such/ qifa'N* tat ive 
variates. 

Tf^us, exception may be 'taken with the eval'ijatiw C|f thejMat-Sea-Cal 
in this category. , i: / ; ^ 

Racial,. Ethnic, and Sex Representation . The Mat-5fefii-Ca1 f^fled 
receive either of two points in this category. ^ Foue^^considerations 
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(p. 41) were used to detenjnine the rating, , - 
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Examination 4Df sources quoted by Dr. Matluck's paper would, again, 
reftite the N.W.R.E.L. rating. Ethnic, racial, anji sex character-i sties 
have been detailed' in all Mat-Sea-Cal .studies fo dafte^ 

Thus, the" instrument is deserving of a rat1ng;of Iwo'points within 
this category. 

Can Decisions Be Made , The Mat-Sea-Cal was awarded one of two 

points on the issue of whether test data was useful in making decisions 

.» 

concerning individual examinees. The reviewers' presented examples 
of statements (p. 42) which thfey considered to be evidence of decision " 
making prowess. The -Inclusion of similar statements in test ^manuals 
resulted in favorable ratings. 

The reviewers did not specify the Extent to which 'decision ^ 
statements' had been verified by supptirt data (p. 42). Generating 
decision statements from test scores -strongly implies the presence of 
predfttive validity. As such decisions affect the educational oppor- 
tunities offered to. learners, confidence in pursuing the recommended 
^decisions must result from egipirical evidence. 

"VJithout requiring support evidence, the reviewers would be en- 
couraging unsubstantiated hypothesizing in test manuals. On the other 
hand, rating predlcit'ive v^idity (in this category)' again raises "the 
question as to what psychometric extent field test and commercial 
instruments may be permitted to differ. The expectation is that 
commercial measures exhf bit more concrete evidence. But is it reasonable 
to evaluate field test instruments (and give 'tIfMrTower ratings) 
employing the standards used to judge commercial measures? - 

• In any case, pre4.ietive type concerns have been previously rated ^ 
elsewl3ere..ih the review schematic? 



Alternate Forms . This category is an extension of the reviewers' 
bias in favoring tests for which alternate forms have beeji developed- ./ '^ 
Comments offered in the technical excellence section (reljerf^ing to |" * 
alternate form and lest-retest reliability) would appl^ equally as well, 
here. • , 

Form Comparability . This section also extends this reviewers' 
preference for tests having at least two forms. No provision is made, 
in the rating scale for one form instruments except that they awarded • ^ 
no'points (i.e., their overall administrative usability rating is lowered). ^ 

S ummary: Administrative Usability . The IJ|&t-Sea-Cal received 

seven of elgnteen (possible) points for administrative usability. This 
" " ■ J?* 

resulted in the reviewers classifying the instrument as "poor'* on 

this criterion. 

This evaluation is questionable on three counts. First,' the 
reviewers failed to ptjrsue source documents in obtaining information 
related to certain rating categories. This oversight denied the. instru- ^ 
ment rating points,^ and a more accurate and favorable" evaluation in those 
categories, ^e^ond, the reviewers essentially rated the same topics' 
over, and over again (alternate forms, being the most glaring example). 
Third, the reviewers insist on evaluating field test instruments with 
the same standards used for commercial measures. This persisteij/te prevents 
an evaluation of field test measures relative to the stage of test 
development at which they are at. 

In summary, the Mat-Sea-Cal ' s rating on administrative usability 
may be challenged. Gjven its present status in the test development 
. process, it is deserving of at 'least a rating of "fair". , 

25 , 
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I * SYNOPSIS 

'This paper crit*iqued the Northwest L^b's rev^iew of the Mat- 
^ea-Ca1 Tests, which was presented in the Lab's publi cat i.oji. Oral Lang - 
uage Tests for^ Bilingual Students ( 1976). Several points' of difference ; 
between the Lab's reviewers and this author Were noted^"^ 

Specific differences regarding the Mat-Sea-Cal's overall rating^, 
on the four evaluation criteria were as follows: ' . . 

Criterion N .W.R.E,L. Rating Critique* s ^Rating 

1. Measurement Validity poor fair 

2. Examinee Appropriateness good "^'ood, . 

3. Technical Excellence poor' i good 

4. Administrative Usability ^^^;^oor • fair-good 



Questions were also raised cWwerning N.W.R.fe.L/s repeated j,r--''r'-^^' 

evaluation of certain items , alternate fonris, scoring, admlri-^^^?^^^^^^^ 

istration, instructions, etc.). Further, the point valUe range within 

evaluation categories varied considerably:/ zero-four, zdro-two; zero-one 

« 

The reviewers never presented a justification or a* discussion of these 
range differences. ' ■ 

On the technical side, concern was expressed for the reviewers* 
collective knowledge of the concept of reliability. Also, the reviewers 
did not extend a concerted effort t6 obtain informative source tibcu- 
ments. 

In conclusion, two recommendations must be made on the basi;5 of. 
this critique. First, contrary to the U.S:O.E. endorsement (pp. 5 - 6), 
the N.W.R.E.L. publication is not recommended for educators' use in 

■ 7 • 
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selecting oral proficiency measures. The booklet conjtains too many 
omissions and misinterpretations. , . ^ ^ 

, Second, an attempt is needed to inform educators of the shortcomings 
contained in the N,W.R.E-L. publication. This is neces^sary so that 
the development of the Mat-Sea-Cal Tests, with .cooperation from school 
districts, will not be hindered by the statements made in the N.W.R.ElL, 
review. 



